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Objectives: This work aims at uncovering challenges in biomedical knowledge representation research by providing an un- 
derstanding of what was historically called "medical concept representation" and used as the name for a working group of the 
International Medical Informatics Association. Methods: Bibliometrics, text mining, and a social media survey compare the 
research done in this area between two periods, before and after 2000. Results: Both the opinion of socially active groups of 
researchers and the interpretation of bibliometric data since 1988 suggest that the focus of research has moved from "medi- 
cal concept representation" to "medical ontologies". Conclusions: It remains debatable whether the observed change amounts 
to a paradigm shift or whether it simply reflects changes in naming, following the natural evolution of ontology research and 
engineering activities in the 1990s. The availability of powerful tools to handle ontologies devoted to certain areas of biomed- 
icine has not resulted in a large-scale breakthrough beyond advances in basic research. 
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I. Introduction 

The study of the meaning of language expressions has a long 
history in health informatics, both regarding narratives (e.g., 
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text in clinical reports and from the biomedical literature) 
and structured information (e.g., terms from standard vo- 
cabularies used for clinical research, health statistics, quality 
assessment and billing). It motivated the activities of the In- 
ternational Medical Informatics Association (IMIA)'s Work- 
ing Group on Medical Concept Representation (MCR WG) 
[1], which was an influential body in the late 1980s and the 
1990s, publishing regular overviews [2]. 

The evolution of ontologies for biomedical research, the 
proliferation of clinical vocabularies, advances in human 
language technologies with increasingly large amounts of 
training data have changed the health information science 
landscape profoundly. New scientific communities have aris- 
en like the Semantic Web community, and social media are 
changing communication between researchers. In this con- 
text the MCR WG, now renamed to "Language and Meaning 
in Biomedicine (LaMB)" will have to find a new ecological 
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niche. In order to better define the future activities of this 
working group, the authors have investigated the evolution 
of the field of biomedical language and representation of 
meaning over the years, and will discuss some persistent re- 
search areas to be addressed in the future. 

II. Methods 

The analysis of literature over time can provide insight in how 
a research field develops [3]. We have used bibliographies, 
on-line text mining tools and a social media survey tool, in 
order to investigate how the research area, known as "Medical 
Knowledge Representation" has evolved since the 1990s. 

The phrase "medical concept representation" (not to be 
mixed with "concept representation" as a category used in 
the science of psychology) was key in that period — a rea- 
son to name the working group accordingly. Therefore, we 
placed this phrase in the centre of our investigation, divided 
into the following steps: 

• Time line analysis of the occurrence of the phrase "medical 
concept representation" using the Scopus term analyser 
[4], extraction of the contextual environment using Ulti- 
mate Research Assistant [5] and visualization of the results 
using a tag cloud [6]; 

• Using the tool Publish or Perish [7] to identify the authors 
of the most influential papers, using seven sources, viz. 
Web of Science, Scopus, Embase, PubMed, Google Scholar, 
Cochrane Library, British Library on-line catalogue. The 
question was to have an idea of the persistence of the influ- 
ential authors from the first period to the second one. The 
Boolean search expression "concept representation" AND 
("medical" OR "medicine") AND ("knowledge" OR "infor- 
mation") was submitted to all of them, with variations ac- 
cording to their proprietary syntax. For identifying the top 
ten papers, the results of the seven lists were consolidated 
into a common table. For this, available citation ranks were 
taken, otherwise the source's own ranking mechanism was 
used. In the following, the top ten papers were the source 
for extracting the top thirty authors, which were ranked in 
a second step. For this, the following heuristics was used: 
The n ,h author in the list was assigned a score of 1 1 - n, the 
eleventh and following authors was given a zero value. The 
scoring was weighted, favouring multiple appearances of 
authors in different sources: a final score was calculated as 
a net score (0.8 + 0.2 x occurrence). 

• In the post-2000 analysis, due to the significant drop of 
the usage of the exact phrase "medical concept represen- 
tation" the resulting paper population would have been 
too small for applying the same procedure as described 



for the first period. Therefore, instead of summing up the 
citation data only for papers matching the query, here 
the citation data for all papers per author were used. This 
same method, however, could not be used for same analy- 
sis backwards to the previous period, due to limitations of 
the tool used [7]. 

• The hypothesis of a paradigm shift was studied, compar- 
ing relevant papers published during the years from 1988 
to 1999 with those appearing between 2000 and 2012, 
focusing the same subject area. The reason for starting 
with 1988 was the availability of bibliographic databases, 
being almost accordant with the period of our interest, 
viz. the activities of the IMIA WG on Medical Knowledge 
Representation. Author lists were compared and all the 
titles of the two full paper sets were text mined using 
Textalyser [8]. 

• The second, more recent set was cross-checked against 
a third set from the same period, obtained by an online 
survey targeted to the specifically interested audience. 
For this survey (open from August to October 2012) the 
primary source was the Linkedln group of the MCR WG, 
having at that time over fifty members of widely various 
backgrounds. Secondary sources were additional Linke- 
dln Groups in broader domain. Participants were asked 
to quote and to share the papers they found to be most 
influential in their work or research. We used Datagle [9] 
and a Google document to collect survey data. 

III. Results 

1. Looking Back: 'Medical Concept Representation' be- 
fore the Turn of the Millennium 

Scopus has revealed that the exact phrase "medical concept 
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Figure 1. Scopus time line analytics results for the exact phrase 
"medical concept representation". 
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Figure 2. Wordle tag cloud generated 
from result of catchphrase 
search using Ultimate Re- 
search Assistant [5]. 



representation" was used mostly in the nineties (Figure 1). 
Scopus data were available for 1993-2008. The targeted se- 
mantic search revealed a wide conceptual domain related to 
this phrase, as shown in Figure 2. 

The top thirty authors of the ten most influential papers 
1988-1999 were identified (the starting date of the study 
was justified by the availability of electronic bibliographic 
databases and the comparability of the investigated peri- 
ods before and after 2000). The tool Publish or Perish [7] 
showed the average number of authors to be 2.45. The re- 
sults of the extraction of the first three author names per 
paper are shown in Table 1. Our querying strategy was 
found effective for excluding papers regarded irrelevant for 
our purpose, e.g., in the domain of concept representation 
in psychology. 

A frequency analysis of the title words of the papers in 
the same period shows the most frequently used uni- and 
bi-grams (single noun phrases and meaningful two-word 
phrases) in Table 2. Note that 'ontology' was not among the 
most frequently used terms at that period. 

2. "Medical Concept Representation" Since the Year 
2000 

Table 3 presents the list of the top thirty authors of most cit- 
ed publications, using the same Boolean expression applied 
to the period of 2000-2012. However, as the methodology 
was different for the reasons explained above, the compari- 
son should be interpreted with reservation. Nevertheless it 
is striking that the two lists only overlap in three authors (in 
bold). In addition, the word frequency analysis of the period 
2000-2012 shows a clearly distinct result (Table 4). 



Table 1. The thirty most influential authors of the period 1988- 
1999 that used the phrase 'medical concept representation' 



Score 


Authors (1-15) 


Score 


Authors (16-30) 


83.2 


Cimino JJ 


22.8 


Rosse C 


43.2 


Oliver DE 


21.0 


Miller RA 


39.2 


Baud RH 


21.0 


Rassinoux AM 


39.2 


Scherrer JR 


19.2 


Musen MA 


35.0 


Rector AL 


19.2 


Nowlan WA 


33.6 


Bell DS 


18.2 


Wagner )C 


33.6 


Shahar Y 


18.0 


Bailey KR 


33.6 


Shortliffe EH 


18.0 


Bauer BA 


30.8 


Fieschi M 


18.0 


ElkinPL 


30.8 


HuffSM 


16.8 


Chute CG 


30.8 


loubert M 


15.6 


Schoolman HM 


30.8 


Volot F 


15.6 


Barnett GO 


26.6 


lohnson SB 


15.6 


Horrocks I 


22.8 


Evans DA 


15.6 


Humphreys BL 


22.8 


Hersh WR 


15.6 


Lindberg DAB 



3. Mapping the Conceptual Context of Most Influential 
Papers Based on Text Mining of Titles 

Figure 3 shows how the terms in the titles were changing. 
"Old" terms that are no longer found among the "new" 
top ten are depicted in white. New terms appearing in the 
2000-2012 list are shown in red. The top ten terms also sug- 
gest that the subject matter of "concept representation" was 
broadened (from focusing on "medical" to areas as "health" 
and "clinical"). In addition, the words "semantics" and "on- 
tology" suggest that new ideas have influenced the concept 
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Table 2. List of most frequently used uni- and bi-grams of the 
period 1988-1999 



Table 4. List of most frequently used uni- and bi-grams of the 
period of the period 2000-2012 in the domain of medical con- 
cept representation 



Rank 


Words 


Word bi-grams 








1 


knowledge 


medical language 


Rank 


Words 


Word bi-grams 


2 


language 


natural language 


1 


health 


health informatics 


3 


concept 


case based 


2 


information 


electronic health 


4 


clinical 


knowledge representation 


3 


clinical 


natural language 


5 


terminology 


knowledge acquisition 


4 


knowledge 


concept based 


6 


data 


language processing 


5 


ontology 


decision support 


7 


representation 


medical concept 


6 


case 


language processing 


8 


information 


medical terminology 


7 


data 


concept representation 


9 


model 


structured data 


8 


semantic(s) 


medical language 


10 


system 


concept representation 


9 


concept 


medical informatics 








10 


representation 


description logic 



Table 3. Set of most cited authors, between 2000 and 2012, 
covering the whole domain of all authors publishing on medical 
concept representation 



Authors (1-15) 


Cited 


Authors (16-30) 


Cited 


Smith B 


125,229 


NoyNF 


21,195 


Roberts A 


62,871 


Nadeau SE 


20,886 


Stevens R 


62,715 


JoffeH 


19,204 


Horrocks I 


60,177 


WroeC 


17,530 


Van Harmelen F 


58,626 


Lussier, Y 


14,802 


Fensel D 


58,491 


Coronado S 


14,105 


Zadeh LA 


49,420 


Saraceno C 


6,976 


Goble C 


46,984 


Sioutos N 


6,584 


Heilman KM 


45,858 


YaoYY 


5,516 


Decker S 


41,092 


Shagina L 


5,407 


Friedman C 


40,473 


Hartel FW 


3,211 


Pal SK 


31,200 


Mejino JR 


2,975 


Musen MA 


28,181 


Haber MW 


2,325 


Aspden P 


23,482 


Shiu SCK 


1,611 


Rosse C 


22,253 


Steinman F 


1,074 



Names that are in the 1988-1999 ranking are in bold face. 

representation domain. The fact that "language", "model" 
and "terminology" disappeared may suggest that some more 
differentiated areas branched off the previously common 
roots. 

4. Results of the Survey Taken Show the Opinion of So- 
cially Active Researchers Interested in the Domain 

The survey had 42 respondents. Not surprisingly, the central 
role of ontologies is clearly reflected in the list of the twenty 
most influential papers (Table 5). Recurring resources in- 



Bold face highlights the terms that also occur in the top-ten list 
from the 1988-1999 period (Table 2). 



Frequency 
rank 



Words 




1988-1999 



health 
information 

clinical 
knowledge 
ontology 
case 
data 
semantic (s) 

concept 
representation 

2000-2012 



Figure 3. Changes in the most frequent title words of papers on 
medical concept representation. 



elude the Open Biological and Biomedical Ontologies (OBO) 
Foundry [10], the Gene Ontology [11], Systematized No- 
menclature of Medicine Clinical Terms (SNOMED CT) [12], 
and the Unified Medical Language System (UMLS) [13]. 

IV. Discussion 

1. Methodology Issues Regarding the Literature Study 

Although the methodology applied in this paper does not 
aim at establishing a new scientometric index or a general- 
izable tool, it clearly demonstrated that on-line searchable 
library databases, bibliometric services, and simple text 
mining tools enable the creation of study-focused tool sets 
as used in this study without investing much effort and re- 
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Table 5. Titles of the twenty most influential papers as listed by Linkedln MCR WG members 



Rank" 


Title 


1 


The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration (849) [10] 


2 


The Unified Medical Language System (UMLS): integrating biomedical terminology (709) [13] 


2 


A reference ontology for biomedical informatics: the Foundational Model of Anatomy (705) [14] 


2 


Relations in biomedical ontologies (672) [15] 


2 


Desiderata for controlled medical vocabularies in the twenty-first century (457) [16] 


2 


Clinical terminology: why is it so hard? (234) [17] 


2 


From concepts to clinical reality: an essay on the benchmarking of biomedical terminologies (81) [18] 


2 


BioCaster: detecting public health rumors with a Web-based text mining system (63) [19] 


3 


Gene ontology: tool for the unification of biology (10,008) [11] 


3 


Sweetening ontologies with DOLCE (668) [20] 


3 


The medical dictionary for regulatory activities (MedDRA) (198) [21] 


3 


SNOMED clinical terms: overview of the development process and project status (150) [22] 


3 


Towards a reference terminology for ontology research and development in the biomedical domain (102) [23] 


3 


Methods in biomedical ontology (92) [24] 


3 


Ontology-based error detection in SNOMED CT (82) [25] 


3 


Fuzzy health, illness, and disease (60) [26] 


3 


Modeling biomedical experimental processes with OBI (58) [27] 


3 


Bringing epidemiology into the Semantic Web (1) [28] 


3 


A dictionary of epidemiology (book) [29] 


3 


Semantic interoperability for better health and safer healthcare [30] 



MCR WG: Working Group on Medical Concept Representation, OBO: Open Biological and Biomedical Ontologies, SNOMED CT: 
Systematized Nomenclature of Medicine Clinical Terms, OBI: Ontology for Biomedical Investigations. 



frequency ranking is based on incidence in survey lists. The first ranked paper was mentioned the most times in various lists. Pa- 
pers with rank '2' shared the second highest number of occurrence and so on. Papers with same rank are in order of their citation 
frequency, shown above in italics. 



sources. Using multiple, large bibliographic source databases 
helped to alleviate the possible bias in such studies that are 
limited to one particular source or aspect of the field. 

2. Current Trends 

The tools we used in this study were aimed at exploring the 
specific area of medical concept representation with the fo- 
cus on testing the complementary question as to whether the 
observed changes amount to a significant paradigm shift. 

Our results show that researchers active in this area for 
several decades have pursued the main goal of being able 
to make health-related information machine readable and 
processable. This has been a major driver of the develop- 
ment of clinical information systems in general. The use of 
formal languages, such as description logics, has been a step 
in this direction. In 1990s, "medical concept representation" 
was seen as a solution by proposing just one general method: 
practical conceptualization of information in medical re- 



search and practice. However, these efforts were hindered by 
theoretical issues, difficulties of modelling a domain, and the 
explosion of knowledge in general [31]. 

Building on this background, our investigation has taken 
the pulse of a group of researchers interested in what we 
could refer to, generally speaking, as the study of meaning of 
structured and unstructured representations. First of all the 
use of the term "concept" has decreased, which we attribute 
to the following factors: 

• Propagation of the paradigm of ontological realism, the 
proponents of which have been arguing against the usage 
of this word in the context of ontologies, contending that 
the representation of concepts as "entities of thought" is 
inappropriate for the representation of a scientific domain 
and obfuscates the difference between the entities and 
names given to them [32]; 

• The preference of "class" over "concept" in the Semantic 
Web and description logics community, especially regard- 
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ing the influential OWL family of representation languag- 
es [33]; 

• The obvious polysemy of the word itself [34] . 

In addition, the popularity of the word "ontology" shows 
a new tendency in which artefacts that represent types of 
domain entities are more clearly distinguished by some re- 
searchers from artefacts that describe language items. The 
importance of ontology-based artefacts can be seen by the 
central place the OBO Foundry and SNOMED CT occupy 
in publications and importance judgments. However, the 
boundaries between ontologies and knowledge representa- 
tion artefacts are less clear, although relatively crisp criteria 
can be formulated. In practice, "ontology" is used by many to 
refer to a wide array of resources across the semantic spec- 
trum, encompassing terminologies, thesauri, classifications 
and formal ontologies [35]. 

At the same time important areas as medical language 
processing and medical terminologies, but also metadata, 
semantic annotation and folksonomies have gained impor- 
tance, so that they are no longer subsumed under "concept 
representation". 

The analysis of influential authors faced methodological 
difficulties, as the selection criterion — namely the phrase 
"concept representation" turned out to be a moving target. 
The comparability of the two lists of authors is therefore lim- 
ited. Nevertheless, it is noteworthy that only three authors 
appeared on both lists. Note that this comparison is addi- 
tionally biased by the following: it is very likely that there are 
relevant authors in the second period that were not retrieved, 
simply because they did not use the — already outmoded — 
phrase "concept representation", at all. There are authors of 
the papers in Table 5 that are not among the top 20 (Table 
4), simply because they avoid that phrase. If they would have 
been included, the overlap were probably even lower. 

V. Conclusion 

There are several indications that the turn of the new mil- 
lennium coincided with a change in the focus of research in 
medical domain representation and semantics. The millen- 
nium marked the emergence of the establishment of applied 
ontology [36] and the Semantic Web [37] as new disciplines. 
The central role of the term "concept" has been gradually 
abandoned. Whether this really amounts to a paradigm shift, 
or a simple change in terminological preferences, may be 
argued. Undoubtedly, the ontology research and engineer- 
ing efforts, which started around 1990, yielded important 
results, including the development of description logics [38] , 
tools like Protege [39], as well as the groundbreaking GA- 
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LEN project [40]. 

The following directions for the future have emerged from 
our analysis: 

• The capture of medical information and knowledge lever- 
ages (standards) ontologies; 

• Open reference resources for content are developed col- 
laboratively, shared, and reused; 

• Web enabled standards help achieve transparent results; 

• "Big data" opens new ways for knowledge acquisition; 

• However, a large part of clinical information continues 
being recorded as free text, which keeps the need of pro- 
cessing medical language on the research agenda. 

All these topics justify, more than ever, collaborative re- 
search and development efforts, for which the IMIA WG 
Language and Meaning in Biomedicine (LaMB) [41] can be 
an effective catalyst. 
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